Coreference in Annotating a Large Corpus

نویسندگان

  • Eva Hajicová
  • Jarmila Panevová
  • Petr Sgall
چکیده

The Prague Dependency Treebank (PDT) is a part of the Czech National Corpus, annotated with disambiguated structural descriptions representing the meaning of every sentence in its environment. To achieve that aim, it is necessary i.a. to make explicit (at least some basic) coreferential relations within the sentence boundaries and also beyond them. The PDT scenario includes both automatic and 'manual' procedures; among the former type, there is one that concerns coreference, indicating the lemma of the subject in a specific attribute of the label belonging to a node for a reflexive pronoun, and assigning the deleted nodes in coordinated constructions the lemmas of their counterparts in the given construction. 'Manual' operations restore nodes for the deleted items mostly as pronouns. The distinction between grammatical and textual coreference is reflected. In order to get a possibility of handling textual coreference, specific attributes reflect the linking of sentences to each other and to the context of situation, and the development of the degrees of activation of the 'stock of shared knowledge' will be registered in so far as they are derivable from the use of nouns in subsequent utterances in a discourse. 1. Overview of the annotation procedure 1.1. The units of annotation in the Prague Dependency Treebank (PDT) are sentences as occurring in the texts in the Czech National Corpus, and the human annotators are instructed to assign every sentence a (disambiguated) structural description according to the meaning of the sentence in its environment. In the manual phase, the annotators are helped by a 'user-friendly' software that makes it possible to work with diagrammatic shapes of the trees. Several parts of the tagging procedure can be formulated as general steps, carried out automatically (see +DMLþ +DMLþRYi 2QH RI WKHVH SDUWV IROORZV after the dependency structure of the sentence (the nodes of the dependency tree and the syntactic relations indicated by labels of the edges) has been indicated by the annotators. Among other tasks, this module adds certain points concerning coreference: (i) the lemma of the node carrying the functor value ACT is assigned to the attribute COREF of an occurrence of the reflexive pronoun se that has not yet been treated (i.e. the PAT Patient, Objective of an active verb); (ii) the remaining nodes without lemmas (in coordinated constructions or in apposition) are assigned the lemmas of their counterparts in the given construction; e.g. in Jirka pozval Marii a Karel Milenu (lit. 'Jirka invited Mary and Karel Milena'), the node corresponding to the deleted second occurrence of the verb (which has been added "by hand" as governing both Karel.ACT and Milenu.PAT) gets a lemma identical to that of the lefthand coordinated item. The annotation on the underlying syntactic layer (the resulting structures being called tectogrammatical tree structures, TGTSs) is carried out in parallel in two streams both having as their inputs the result of the automatic preprocessing of the 'analytic' (surface) syntactic trees (in which every word token and every punctuation mark have their corresponding nodes and the basic kinds of dependency relations are specified); for a description of this procedure, see %|KPRYi DQG +DMLþRYi 7KH outputs of these streams differ in the size of data and the size of information carried by the tags: (A) the set of “core” TGTSs (called 'large corpus', LC) has a large size, is being annotated with a higher speed and with tags carrying information about (a) the types of dependency relations and (b) values indicating the topic/focus articulation; (B) the set of “full” TGTSs (the 'model' corpus, MC) has a smaller size, being annotated with a lower speed and with tags carrying complete tectogrammatical information (for a detailed characteristics of 7*76V VHH +DMLþRYi HW al. 1999). 1.2. Since one of the aims of the PDT is to serve as a resource for linguistic research beyond the limits of the sentence, three specific attributes have been introduced in the TGTSs reflecting the linking of sentences to each other and to the context of situation: (i) the attribute COREF having as its value the lexical value of the antecedent of the given anaphoric node (this node itself may be present on the surface, or deleted; the UHVROXWLRQ RI GHOHWLRQV LV GLVFXVVHG E\ +DMLþRYi DQG 6JDOO 2000), (ii) the attribute CORNUM with a value equal to the serial number of the antecedent of the given node (to avoid uncertainty in case of two occurrences of the same word in the sentence), and (iii) the attribute CORSNT indicating whether the antecedent is in the same sentence (the value NIL) or in the preceding context (the value PREV). If an anaphoric node deleted on the surface is being restored, its lexical value is specified as an anaphoric (weak) pronoun (P in the sequel), a specific lexical value (L), or a technical value (such as Cor for the 'controllee'). 1.3. The system of annotation of the TGTSs makes it possible to reflect the distinction between grammatical and textual coreference (see Panevová 1991). A typical example of the former is the coreference of the subject of the infinitival complementation of the control verbs (the subject gets the lexical value Cor) and the coreference of the reflexive pronouns (getting L identical to that of the subject), as well as that of the relative words in their relationship to their antecedents. With the latter kind of coreference (e.g. the 'deleted' pronominal subjects in Czech as a pro-drop language or other cases of pronominal reference) the nodes for the anaphoric expressions get P as their lexical value. Although also nouns, verbs, etc., can have a coreferential value, which we plan to reflect in the future shape of the procedure (in Czech, nouns in such a position often are accompanied by the pronoun (or determiner) ten 'that'), we do not discuss these cases in the present paper. In the case of grammatical coreference, the substantial feature of which is the presence of the antecedent in a specified syntactic position of the sentence, an additional attribute ANTEC is used with the value equal to the dependency relation (functor) of the antecedent. 2. Textual coreference The textually coreferring node, which either corresponds to a pronoun or is a case of restored deletion, obtains a functor and a P lemma both in the MC and in the LC. In the MC, its attribute COREF obtains as its value the lemma of the antecedent, CORNUM gets the value of the serial number of the antecedent (according to its word-order position, adjusted by decimal fractions in case of preceding deletion restorations); in CORSTN the unmarked value NIL is placed automatically, and changed into PREV if the antecedent is in the preceding sentence. In the LC, the attribute COREF is left unfilled, and if the relevant node has been deleted, it is restored only in the case of a zero subject or of another deleted obligatory participant the head of which has not been deleted and is constituted by a deverbal noun or adjective of a fully SURGXFWLYH W\SH DV IRU GHOHWLRQ UHVWRUDWLRQ FI +DMLþRYi and Sgall, 2000; it should be noted that a restored node is always marked by the value ELID in one of its attributes). In (1) and (2), we give examples of coreferential zero subjects in MC (we embrace the added nodes in square brackets): (1) 8G ODO >on.ANIM.SG.ACT.ELID] to. °He has done it°. (2) Byla [ona.FEM.SG.PAT.ELID] S HGE KQXWD Q NROLND jinými. °She was left behind by some others°. While with (1) the Gender value is based on intrasentential context (the properties of the verb), with (2) the clue is only present in intersentential context: ona is ambiguous (similarly as the forms byla and S HGE KQXWD, on the base of the agreement with which it has been restored), having also the value 'they', NEUT.PL (e.g. if the neuter noun G YþDWD 'girls' is the antecedent). With most other pronominal forms the number will be supplied automatically, but Gender and the value of the Functor are filled in manually, which is necessary also in case the pronoun has not been deleted; only in certain specific cases an automatic solution is possible, e.g. with a plural noun in the Vocative case accompanying the subject, as in (3), or with the verb-subject agreement disclosing the Gender of the subject, as in (4): (3) Vy jste, kluci, spali? 'You, boys, have been sleeping?' Vy.ANIM.PL.ACT;COREF:kluk;CORNUM:4 jste, kluci, spali? (4) My jsme tam byly všechny. °We (women, girls) have been there all°. My.FEM.PL.ACT jsme tam byly.FEM.PL všechny. In (4), also some other attributes should be manually assigned their values if there is an antecedent in the previous sentence (otherwise just symbols for empty values are present). It may be recalled that a verb such as prší 'it rains' has no dependent ACT; its valency only admits adverbial adjuncts. Under textual coreference also wider anaphoric relations are understood, which do not represent full referential identity, as e.g. in (5), in which oni 'they' is interpreted as referring to a group that includes Anna. (5) Anna zase QHS LãOD Oni všichni þDVWR FK\E Mt 'Anna failed to turn up again. They all often are absent.' In the months to come, the automatic procedure is supposed to be enriched in various respects, to cover at least the most regular phenomena of several further subdomains, among which it is directly relevant for textual coreference that the development of the degrees of DFWLYDWLRQ RI WKH VWRFN RI VKDUHG NQRZOHGJH VHH +DMLþRYi 1993) will be registered as far as derivable from the use of nouns in subsequent utterances in a discourse. 3. Grammatical coreference With grammatical coreference, the value of COREF is filled in (by the lemma of the controller, the subject or another antecedent, see below), along with the lemma of the coreferring node and with its functor, both in the LC and in the MC. In the MC, also the values CORNUM and ANTEC are added. In CORSTN, the unmarked value NIL remains, since with grammatical coreference the antecedent occurs in the same sentence. The typical cases of grammatical coreference are reflexive and relative pronouns, and 'control':

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Corpus based coreference resolution for Farsi text

"Coreference resolution" or "finding all expressions that refer to the same entity" in a text, is one of the important requirements in natural language processing. Two words are coreference when both refer to a single entity in the text or the real world. So the main task of coreference resolution systems is to identify terms that refer to a unique entity. A coreference resolution tool could be...

متن کامل

Annotating a Japanese Text Corpus with Predicate-Argument and Coreference Relations

In this paper, we discuss how to annotate coreference and predicate-argument relations in Japanese written text. There have been research activities for building Japanese text corpora annotated with coreference and predicate-argument relations as are done in the Kyoto Text Corpus version 4.0 (Kawahara et al., 2002) and the GDATagged Corpus (Hasida, 2005). However, there is still much room for r...

متن کامل

Transferring Coreference Chains through Word Alignment

This paper investigates the problem of automatically annotating resources with NP coreference information using a parallel corpus, English-Romanian, in order to transfer, through word alignment, coreference chains from the English part to the Romanian part of the corpus. The results show that we can detect Romanian referential expressions and coreference chains with over 80% F-measure, thus usi...

متن کامل

Translation-Based Projection for Multilingual Coreference Resolution

To build a coreference resolver for a new language, the typical approach is to first coreference-annotate documents from this target language and then train a resolver on these annotated documents using supervised learning techniques. However, the high cost associated with manually coreference-annotating documents needed by a supervised approach makes it difficult to deploy coreference technolo...

متن کامل

Generic noun phrases and annotation of coreference and bridging relations in the Prague Dependency Treebank

This paper discusses the problem of annotating coreference relations with generic expressions in a large scale corpus. We present and analyze some existing theories of genericity, compare them to the approaches to generics that are used in the state-of-the-art coreference annotation guidelines and discuss how coreference of generic expressions is processed in the manual annotation of the Prague...

متن کامل

Corefrence resolution with deep learning in the Persian Labnguage

Coreference resolution is an advanced issue in natural language processing. Nowadays, due to the extension of social networks, TV channels, news agencies, the Internet, etc. in human life, reading all the contents, analyzing them, and finding a relation between them require time and cost. In the present era, text analysis is performed using various natural language processing techniques, one ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2000